Comparative study of the evolution of cancer gene duplications across fish

Abstract Comparative studies of cancer‐related genes not only provide novel information about their evolution and function but also an understanding of cancer as a driving force in biological systems and species’ life histories. So far, these studies have focused on mammals. Here, we provide the first comparative study of cancer‐related gene copy number variation in fish. Fishes are a paraphyletic group whose last common ancestor is also an ancestor of the tetrapods, and accordingly, their tumour suppression mechanisms should include most of the mammalian mechanisms and also reveal novel (but potentially phylogenetically older) previously undetected mechanisms. We have matched the sequenced genomes of 65 fish species from the Ensemble database with the cancer gene information from the COSMIC database. By calculating the number of gene copies across species using the Ensembl CAFE data (providing species trees for gene copy number counts), we used a less resource‐demanding method for homolog identification. Our analysis demonstrates a masked relationship between cancer‐related gene copy number variation (CNV) and maximum lifespan in fish species, suggesting that a higher number of copies of tumour suppressor genes lengthens and the number of copies of oncogenes shortens lifespan. Based on the positive correlation between the number of copies of tumour suppressors and oncogenes, we show which species have more tumour suppressors in relation to oncogenes. It could be suggested that these species have stronger genetic defences against oncogenic processes. Fish studies could be a largely unexplored treasure trove for understanding the evolution and ecology of cancer, providing novel insights into the study of cancer and tumour suppression, in addition to fish evolution, life‐history trade‐offs, and ecology.


S2. Get Fish Homolog Gene Counts for COSMIC Genes Introduction
The aim of this of document is to get copy number count of fish genes for all of the human cancer associated genes. The cancer genes are from the COSMIC (version 92) database (see S1). The fish are those that have a sequenced genome in the Ensembl (v104) database and have a CAFE tree associated with them. The copy number counts are obtained by two methods: 1. Downloading the Ensembl CAFE gene duplication species trees for all of the COSMIC genes and extracting the copy numbers from there. 2. Downloading the list of human COSMIC gene orthologs for each fish species represented in the Ensembl database using BioMart and counting the unique confident orthologs in each species for each COSMIC gene.

Functions
The script below uses several custom functions to perform the aforementioned tasks.

Getting CAFE gene trees
Downloading and pre-proccesing the gene trees from the Ensembl CAFE database. #should the latest data be downloaded far all of the enembl IDs in COSMIC #at first run change it to TRUE redownload <-FALSE

Load trees for all studied species
Next we are pruning the trees to keep only the selected species and then extracting the gene counts from these lists.

Get ortholog gene counts from ensembl
The number of human homologs for each gene in the list of selected species is downloaded from Ensembl. #set downloadHumanHomologues to TRUE to download data #It might fail if Ensembl is not responisve downloadHumanHomologues<-FALSE #NB the directory must exist for saving! humanHomologuesFile <-"./data/ensembl/fishCosmicHumanHomologues.rds"

Introduction
The aim of this supplementum is to establish if the methods for obtaining counts of cancer gene copy numbers produce similar results. The methods to compare are the method from  and the Ensembl CAFE and Ensembl ortholog approach from this paper. The species in the comparison include those mammal species that are present in all of the datasets.

Read data
Read in the original data from   There are 33 species that are present in all data frames.

Check calculations
Lets first check if our method of obtaining normalized gene counts is the same as used in . The y-axes represents the data from this paper while the x-axes are from the    Figure S1. Linear regression between TSG copy number calculations. The y-axes represent the data calculated in this paper from Tollis' original data whilst the x-axes are from the paper of . Figure S2. Linear regression between oncogene copy numbers calculations. The y-axes represent the data calculated in this paper from Tollis' original data whilst the x-axes are from the paper of . Figure S3. Linear regression between Somatic gene copy numbers calculations. The y-axes represents the data calculated in this paper from Tollis' original data whilst the x-axes are from the paper of . Figure S4. Linear regression between gatekeeper copy numbers calculations. The y-axes represents the data calculated in this paper from Tollis' original data whilst the x-axes are from the paper of .
The above plots confirm that (at least for most reported values in ) the normalisation is performed consistently and repeatably. The difference in Somatic may come from changes in the COSMIC database or because Somatic only are those that have no Germline mutation. This can be checked but is largely irrelevant.

Correlations TSGs
Lets see if different measures from  for TSG count yield similar results to our Ensembl CAFE and homologue approach. Figure S5. Linear regression between TSG copy number calculations. The y-axes represent the data from this paper extracted from Ensembl CAFE or the Homolog approach whilst the x-axes are from the paper of . Rows are different COSMIC Tiers (tier 1 or Tier 1&2) and columns represent Ensembl CAFE or Homolog approach. The data on x axis is same for all plots. Figure S6. Linear regression between TSG copy number calculations. The y-axes represent the data from this paper extracted from Ensembl CAFE or the Homolog approach whilst the x-axes are from the paper of . Rows are different COSMIC Tiers (tier 1 or Tier 1&2) and columns represent Ensembl CAFE or Homolog approach. The data on the x axis is the same for all plots. Figure S7. Linear regression between TSG copy number calculations. The y-axes represents the data from this paper extracted from Ensembl CAFE or the Homolog approach whilst the x-axes are from the paper of . Rows are different COSMIC Tiers (tier 1 or Tier 1&2) and columns represent Ensembl CAFE or Homolog approach. The data on the x axis is the same for all plots.
From the above plots it is evident that there is a significant positive correlation between the Tollis "TSG_w_Zero" and the data extracted from Ensembl CAFE but it is not very strong. Also, the mammal homolog calculations gives a completely different result due to the differences in Ensembl Homolog level calculations beween taxa. So the homolog approach should not be used for mammalian data.

Oncogenes
Lets see if different measures from  for oncogene counts yields similar results as the Ensembl CAFE and the homolog approach. The Oncogene results are similar to the TSG results. Figure S7. Linear regression between Oncogene copy number calculations. The y-axes represents the data from this paper extracted from Ensembl CAFE or the Homolog approach whilst the xaxes are from the paper of . Rows are different COSMIC Tiers (tier 1 or Tier 1&2) and columns represent Ensembl CAFE or Homolog approach. The data on x axis is same for all plots. Figure S8. Linear regression between Oncogene copy number calculations. The y-axes represents the data from this paper extracted from Ensembl CAFE or the Homolog approach whilst the xaxes are from the paper of . Rows are different COSMIC Tiers (tier 1 or Tier 1&2) and columns represent Ensembl CAFE or Homolog approach. The data on x axis is the same for all plots.

Conclusions
The measures correlate but the correlations are not strong. Strangely the with 0 approach of  correlates best with the Ensembl CAFE results of this paper. So we can say the datasets are reasonably different but aim to measure the same thing.

Introduction
This supplementum aims to plot additional graphs and tables for referencing in the main text. These plots and tables are meant to clarify and illustrate the data and analyses and differ from the ones in the main article mainly by including only ray-finned fishes (Class: Actinopterygii).

Lifespan vs size
First let us plot to see if the magnitude of maximum lifespan is related to the magnitude of maximum body length in all fish species. The underling hypothesis is that species who live longer also tend to be larger. Figure S9. Linear regression between log transformed maximum body length and maximum lifespan. Each point in the plot represents a species. The line and the confidence intervals depicted in the plot come from standard linear regression, the values R2, p and N are from phylogenetically adjusted regression. Figure S10. Linear regression between log transformed maximum body length and maximum lifespan. Each point in the plot represents a species. Only species included are those for whom the lifespan and body size data is extracted from data from AnAge database, fishbase or articles. The line and the confidence intervals depicted in the plot come from standard linear regression, the values R2, p and N are from phylogenetically adjusted regression. Figure 11. Linear regression between log transformed maximum body length and average lifespan. Each point in the plot represents a species. The line and the confidence intervals depicted in the plot come from standard linear regression, the values R2, p and N are from phylogenetically adjusted regression. Figure S12. A plot between log transformed maximum body length and longevity quotient (LQ). Each point in the plot represents a species. To calculate the LQ we used the predict function on the model with only reliable ages (i.e. model from figure 2) similarly to .

Copy number variation (CNV) correlations of oncogenes an TSG's
The next set of plots are for depicting correlations between oncogene and tumour suppressor gene CNVs. In particular, the plots aim to visualize in what respect the obtained results depended on the method of getting copy number counts (CAFE vs ortholog) or subset of cancer genes (COSMIC tier 1 vs COSMIC tier 1&2). Figure S13. Linear regression between copy numbers of different subsets of tumour suppressor genes (all TSGs, gatekeeper genes and caretaker genes) and oncogene copy numbers. The CNVs have been obtained using the CAFE approach and both COSMIC Tier 1&2 genes are included. Each point in the plot represents a species. The line and the confidence intervals depicted in the plot come from standard linear regression, the values R2, p and N are from phylogenetically adjusted regression. Figure S14. Linear regression between copy numbers of different subsets of tumor supressor genes (all TSGs, gatekeeper genes and caretaker genes) and oncogene copy numbers. The CNVs have been obtained using the ortholog approach and both COSMIC Tier 1&2 genes are included. Each point in the plot represents a species. The line and the confidence intervals depicted in the plot come from standard linear regression, the values R2, p and N are from phylogenetically adjusted regression. Figure S15. Linear regression between copy numbers of different subsets of tumor supressor genes (all TSGs, gatekeeper genes and caretaker genes) and oncogene copy numbers. The CNVs have been obtained using the ortholog approach and only COSMIC Tier 1 genes are included. Each point in the plot represents a species. The line and the confidence intervals depicted in the plot come from standard linear regression, the values R2, p and N are from phylogenetically adjusted regression.

Cancer gene CNV vs lifepan
One central idea is that a higher number of TSGs and a lower number of oncogenes provides the basis for living longer. In the next section we correlate different CNVs with the magnitude of maximum lifespan. The CNVs are calculated using the CAFE approach on COSMIC Tier 1 cancer related genes. Figure S16. Linear regression between log maximum lifespan and copy numbers of different subsets of cancer related genes (TSG, GateKeeper, CareTaker, Somatic, Germline, Oncogene). The CNVs have been obtained using the CAFE approach and only COSMIC Tier 1 genes are included. Each point in the plot represents a species in the dataset. The line and the confidence intervals depicted in the plot come from standard linear regression, the values R2, p and N are from phylogenetically adjusted regression.

Masked relationship between TSG copy numbers and lifespan
Checking the main idea of this paper, that the higher number of TSGs in longer lived fish is masked by the lower number of oncogenes in these fish. The copy numbers are calculated using the CAFE approach on COSMIC Tier 1 cancer related genes.  Figure S17. Linear regression between log maximum lifespan and the residual copy numbers of TSGs. The residual TSG CNVs have been obtained using the phylogenetically adjusted regression between TSG and oncogene CNVs. The CAFE approach and only COSMIC Tier 1 genes are included. Each point in the plot represents a species in the dataset. The line and the confidence intervals depicted in the plot come from standard linear regression, the values R2, p and N are from phylogenetically adjusted regression. Figure S18. Linear regression between log maximum lifespan and the residual copy numbers of TSGs. The residual TSG CNVs have been obtained using the phylogenetically adjusted regression between TSG and oncogene CNVs. The CAFE approach and only COSMIC Tier 1 genes are included. Each point in the plot represents a species in the dataset. The line and the confidence intervals depicted in the plot come from standard linear regression, the values R2, p and N are from phylogenetically adjusted regression.
All of the above relationships also hold (and are even more confident) in the dataset with only 36 species (only data from AnAge, fishbase and articles).

Body size vs cancer gene CNVs
To reproduce the plots in figure 4 of Tollis et al. (2020) on our fish dataset, we correlated the body size measures with normalized cancer related gene counts also on our fish dataset. Figure S19. Linear regression between log maximum body size and copy numbers of different subsets of cancer related genes (TSG, CareTaker, GateKeeper, Oncogene, Germline, SomaticAndGermline). The CNVs have been obtained using the CAFE approach and only COSMIC Tier 1 genes are included. Each point in the plot represents a species in the dataset. The line and the confidence intervals depicted in the plot come from standard linear regression, the values R2, p and N are from phylogenetically adjusted regression.
Also, let us check for a masked relationship as TSG and oncogene counts are likely evolutionary constrained. Clearly, we are unable to demonstrate that body length depends on oncogene or TSG copy numbers.

Longevity Quotient (LQ) vs cancer gene CNVs
The phylogenetically informed regression indicates a stronger positive correlation between size and lifespan only on highly confident data (see above section Lifespan vs size). Therefore, at first the LQ was calculated using this model. In the plots we plot the same correlations between copy numbers and LQ that we did for maximum body size and lifespan for only these species that we have reasonably confident age estimate available. Figure S20. Linear regression between longevity quotient (LQ) and copy numbers for different subsets of cancer related genes (TSG, CareTaker, GateKeeper, Oncogene, Germline, SomaticAndGermline). The CNVs have been obtained using the CAFE approach and only COSMIC Tier 1 genes are included. The LQs have been obtained using the same method as that of  Each point in the plot represents a species in the dataset. The line and the confidence intervals depicted in the plot come from standard linear regression, the values R2, p and N are from phylogenetically adjusted regression.

Masked relationship between TSG CNV and LQ
Checking the main idea in this paper that the higher number of TSGs in longer lived fish is masked by the lower number of oncogenes in said fish. The CNVs are calculated using the CAFE approach on COSMIC Tier 1 cancer related genes. The estimates are calculated for both more confident maximum lifespan and all available maximum lifespan.

Conclusions
The analyses above suggest that the magnitude of maximum lifespan of fish species is positively affected by the total copy number of tumour suppressor genes and negatively by the total number of oncogenes. As TSG and oncogene CNV is strongly correlated, their relationship with lifespan is masked if one is excluded from the model. The connection between lifespan and TSG CNV is particularly robust as it does not depend much on model parameters inclusion and exclusion of some data points or whether the lifespan is adjusted for body size or the method of obtaining CNV counts.

Introduction
This supplementum aims to plot some graphs and tables for referencing in the main text. These plots and tables are meant to clarify and illustrate the data and analyses. The plots differ from the ones in the main article mainly by including only all teleost fish which have gone through three rounds of whole-genome duplication (WGD). This means ray-finned fishes (Class: Actinopterygii) without salmonids and cyprinids.

Lifespan vs size
First let's plot to see if the magnitude of maximum lifespan is related to magnitude in maximum body length in all fish species. The underling hypothesis is that species who live longer also tend to be larger.

Copy number variation (CNV) correlations of oncogenes an TSG's
The next set of plots are for depicting correlations between oncogene and tumour suppressor gene CNVs. In particular the plots aim to visualize in what respect the obtained results depended on the method used for getting copy number counts (CAFE vs ortholog) or subset of cancer genes (COSMIC tier 1 vs COSMIC tier 1&2). Figure S25. Linear regression between copy numbers of different subsets of tumour suppressor genes (all TSGs, gatekeeper genes and caretaker genes) and oncogenes. The copy numbers have been obtained using the CAFE approach and both COSMIC Tier 1&2 genes are included. Each point in the plot represents a species. The line and the confidence intervals depicted in the plot come from standard linear regression, the values R 2 , p and N are from phylogenetically adjusted regression. Figure S26. Linear regression between copy numbers of different subsets of tumour suppressor genes (all TSGs, gatekeeper genes and caretaker genes) and oncogenes. The copy numbers have been obtained using the ortholog approach and both COSMIC Tier 1&2 genes are included. Each point in the plot represents a species. The line and the confidence intervals depicted in the plot come from standard linear regression, the values R 2 , p and N are from phylogenetically adjusted regression. Figure S27. Linear regression between copy numbers of different subsets of tumour suppressor genes (all TSGs, gatekeeper genes and caretaker genes) and oncogenes. The copy numbers have been obtained using the ortholog approach and only COSMIC Tier 1 genes are included. Each point in the plot represents a species. The line and the confidence intervals depicted in the plot come from standard linear regression, the values R 2 , p and N are from phylogenetically adjusted regression.

Cancer gene CNV vs lifepan
One central idea is that the higher number of TSGs and the lower number of oncogenes provides the basis to live longer. In the next section we correlate different CNVs with the magnitude of maximum lifespan. The copy numbers are calculated using the CAFE approach on COSMIC Tier 1 cancer related genes. Figure S28. Linear regression between log maximum lifespan and copy numbers of different subsets of cancer related genes (TSG, GateKeeper, CareTaker, Somatic, Germline, Oncogene). The copy numbers have been obtained using the CAFE approach and only COSMIC Tier 1 genes are included. Each point in the plot represents a species in the dataset. The line and the confidence intervals depicted in the plot come from standard linear regression, the values R 2 , p and N are from phylogenetically adjusted regression.

Masked relationship between TSG CNV and lifespan
Checking the main idea of this paper that the higher number of TSGs in longer lived fish is masked by the lower number of oncogenes in these fish. The copy numbers are calculated using the CAFE approach on COSMIC Tier 1 cancer related genes.  Figure S29. Linear regression between log maximum lifespan and the residual copy numbers of TSGs. The residual TSG copy numbers have been obtained using the phylogenetically adjusted regression between TSG and oncogene copy numbers. The CAFE approach and only COSMIC Tier 1 genes are included. Each point in the plot represents a species in the dataset. The line and the confidence intervals depicted in the plot come from standard linear regression, the values R 2 , p and N are from phylogenetically adjusted regression. Figure S30. Linear regression between log maximum lifespan and the residual copy numbers of TSGs. The residual TSG copy numbers have been obtained using the phylogenetically adjusted regression between TSG and oncogene copy numbers. The CAFE approach and only COSMIC Tier 1 genes are included. Each point in the plot represents a species in the dataset. The line and the confidence intervals depicted in the plot come from standard linear regression, the values R 2 , p and N are from phylogenetically adjusted regression.
All of the above relationships also hold (and are even more confident) in the dataset with only 40 species (only data from AnAge, fishbase and articles).

Body size vs cancer gene CNVs
To reproduce the plots in figure 4 of Tollis et al. (2020) on our fish dataset, we correlated the body size measures with normalized cancer related gene counts also on our fish dataset. Figure S31. Linear regression between log maximum body size and copy numbers of different subsets of cancer related genes (TSG, CareTaker, GateKeeper, Oncogene, Germline, SomaticAndGermline). The copy numbers have been obtained using the CAFE approach and only COSMIC Tier 1 genes are included. Each point in the plot represents a species in the dataset. The line and the confidence intervals depicted in the plot come from standard linear regression, the values R 2 , p and N are from phylogenetically adjusted regression.
Also, let's check for a masked relationship as TSG and oncogene counts are likely evolutionarily constrained. Clearly, we were unable to demonstrate that body length depends on oncogene or TSG copy numbers.

Longevity Quotient (LQ) vs cancer gene copy numbers
The phylogenetically informed regression indicates a stronger positive correlation between size and lifespan only on high confident data (see above section Lifespan vs size). Therefore, the first the LQ was calculated using this model. In the plots we show the same correlations between copy numbers and LQ that we did for maximum body size and lifespan, for only the species that we have reasonably confident age estimate available for. Figure S32. Linear regression between longevity quotient (LQ) and copy numbers of different subsets of cancer related genes (TSG, CareTaker, GateKeeper, Oncogene, Germline, SomaticAndGermline). The copy numbers have been obtained using the CAFE approach and only COSMIC Tier 1 genes are included. The LQ have been obtained using the same method as  Each point in the plot represents a species in the dataset. The line and the confidence intervals depicted in the plot come from standard linear regression, the values R 2 , p and N are from phylogenetically adjusted regression.

Masked relationship between TSG copy numbers and LQ
Checking the main idea of this paper that the higher number of TSGs in longer lived fish is masked by the lower number of oncogenes in these fish. The copy numbers are calculated using the CAFE approach on COSMIC Tier 1 cancer related genes. The estimates are calculated for both more confident maximum lifespan and all available maximum lifespan.

Conclusions
The analyses above suggest that the magnitude of maximum lifespan of fish species is positively affected by the total copy number of tumour suppressor genes when species with WGD have been removed. As TSG and oncogene copy numbers are strongly correlated, TSG relationship with lifespan is masked if one is excluded from the model. However, the connection between lifespan and TSG copy numbers is not particularly robust as it does depend somewhat on model parameters inclusion and exclusion of some data points or whether the lifespan is adjusted for body size (i.e. longevity quotient) or not.

S6 Copy Number Regressions for Mammal Cancer Genes Introduction
The idea of this supplementum is to do basic phylogenetically informed regressions of cancer gene copy numbers of mammal genes. This analysis is aimed to check if the masked relationship detected in this papers fish datasets holds for the mammal dataset. The cancer genes are from the COSMIC (version 92) database (Sondka et al. 2018). The mammals are those that have a sequenced genome in the Ensembl (v104) database and have a CAFE tree associated with them in Ensembl Compara ). This report is inspired by the paper on copy number variation (CNV) in mammals by . While the aforementioned paper used 63 mammal genomes this report uses 108 species (only 33 species overlap between the 2 datasets).

Data
The copy numbers are obtained for two sets of cancer related genes. The first is the COSMIC tier 1 genes and the second included both COSMIC tier 1 and tier 2 genes. COSMIC is a manually curated list of human cancer genes, that also assigns genes as tumour suppressor genes and oncogenes. COSMIC also provides mutation types such as germline, somatic, or both. We also classified each tumour suppressor genes as being a gatekeeper gene or a caretaker gene according to the list provided by . The copy number count of cancer related genes in mammals is this report is obtained differently from . To get the copy number of the abovementioned COSMIC genes in different mammal species two approaches were used. The first one included downloading the Ensembl CAFE species trees for all of the COSMIC genes. The Ensembl CAFE provides species trees for gene copy number counts. Ensembl CAFE data "gene gains and losses in each GeneTree are calculated in by starting from the number of gene copies in each species and using CAFE (Bie et al. 2006) to estimate how many genes existed in each lineage before a speciation event" cited from . The second method included downloading the list of human COSMIC gene orthologs for each mammal species represented in the Ensembl database using BioMart (Kinsella et al. 2011) and counting the unique confident orthologs in each species, for each gene. The chosen approach is computationally much less intensive than the one used by  as it reuses the computational effort from Ensembl. As the supplementary material S1 indicates, the computational methods provide somewhat different CNV estimates. In our opinion, the Ensembl CAFE approach of calculating the gene gains and losses is superior to the approach of  as it takes into account the phylogenetic relationship of animals in CNV calculation. Then the normalized copy number counts for both cancer gene lists (COSMIC Tier 1 and COSMIC Tier 1&2) and for both copy number count methods (CAFE and Ortholog) was calculated according to . See also code in Baines et al. (2021b).
The maximum length and lifespan data (as well as other parameters) were obtained from AnAge database Tacutu et al. 2018. The phylogenetic tree for the mammal species together with branch lengths was obtained from www.timetree.org. Species that were missing in the timetree database were excluded from the analysis as phylogenetically informed regressions can't be done without phylogenetic distances. From all the mammal species that have a genome in Ensembl (95) only 73 have both adult weight and maximum lifespan given. We further excluded low and questionable data (quality assigned by anAge). This retained 70 species.
Analysis on the Enseble CAFE mammal dataset NB! All of the following graphs display all individual species as data points (N) and the regression line (with intervals) is calculated using the lm() function. However, the R2 and p value are the results from phylogenetically informed regression (function pgls from package caper). Hence, comparing the visuals on the plot and R2 and p values is somewhat misleading.

Lifespan vs size
First we will look if the maximum lifespan is related to maximum body size in all species. The underling hypothesis is that species who live longer also tend to be larger in log scale. Figure S33. Linear regression between log transformed maximum body mass and maximum lifespan. Each point in the plot represents a species. The line and the confidence intervals depicted in the plot come from standard linear regression, the values R 2 , p and N are from phylogenetically adjusted regression. So the direction is the same as for in fish or in the dataset of . Just that we have more available data. Note how the human is a severe outlier. From this correlation we calculated the longevity quotient according to .

Cancer gene counts
To see the correlations between types of different COSMIC gene counts we correlated the normalized copy number counts of different cancer gene measures.

Correlations between different cancer gene counts
First we looked to see if the obtained results depended on the method of getting copy number counts (CAFE vs ortholog) or subset of cancer genes (COSMIC tier 1 vs COSMIC tier 1&2). Figure S34. Linear regression between copy numbers of different subsets of tumour suppressor genes (all TSGs, gatekeeper genes and caretaker genes) and oncogenes. The copy numbers have been obtained using the CAFE approach and both COSMIC Tier 1&2 genes are included. Each point in the plot represents a species. The line and the confidence intervals depicted in the plot come from standard linear regression, the values R 2 , p and N are from phylogenetically adjusted regression. Figure S35. Linear regression between copy numbers of different subsets of tumour suppressor gene (all TSGs, gatekeeper genes and caretaker genes) and oncogenes. The copy numbers have been obtained using the homolog approach and both COSMIC Tier 1&2 genes are included. Each point in the plot represents a species. The line and the confidence intervals depicted in the plot come from standard linear regression, the values R 2 , p and N are from phylogenetically adjusted regression.
Looking at the two above plots it seems that the positive correlations between oncogene and tumors suppressor genes is present regardless of the method of getting cancer gene copy number counts. The same can be said about subset of genes used (plots not shown above but below and in subsequent chapters). The results are extremely similar for tier 1 and for both tiers combined. Figure S36. Linear regression between copy numbers of different subsets of tumour suppressor gene (all TSGs, gatekeeper genes and caretaker genes) and oncogenes. The copy number have been obtained using the homolog approach and COSMIC Tier 1 genes are included. Each point in the plot represents a species. The line and the confidence intervals depicted in the plot come from standard linear regression, the values R 2 , p and N are from phylogenetically adjusted regression.
However, the total number of species in the analysis is larger for CAFE as the ortholog approach failed to produce copy number counts for some species. So we chose CAFE to retain more species in the further analyses and tier 1 because this enables to include only the most studied and validated cancer genes.

Copy number variation (CNV) and lifepan
One central idea is that higher number of TSGs and lower number of oncogenes provides the basis to live longer. Let's correlate different copy number counts with maximum lifespan. Figure S37. Linear regression between log maximum lifespan and copy numbers of different subsets of cancer related genes (TSG, SomaticAndGermline, Oncogene). The copy numbers have been obtained using the CAFE approach and only COSMIC Tier 1 genes are included. Each point in the plot represents a species in the dataset. The line and the confidence intervals depicted in the plot come from standard linear regression, the values R 2 , p and N are from phylogenetically adjusted regression. So all of these correlations are non-significant. However as oncogenes and TSG's should have, in general, an opposite relationship to lifespan it is worthwhile to put both into the model.

Oncogene masked relationship with TSG copy numbers
Lets check, if the oncogene copy number masked relationship between lifespan and TSG copy numbers that we detected in the fish dataset, is also detectable in the mammal dataset. The results above suggest an inverted masked relationship in mammals and are similar to the results in  finding a positive correlation between lifespan and oncogene count. This is true even if we take body size and reproductive effort into this model. Let it be noted, that the CAFE approach of getting gene counts takes the species tree into account. So one might argue that in this case using standard linear regression to check, if the oncogene copy number masked relationship between lifespan and TSG copy numbers would be justified. However, the main reason for using phylogenetically adjusted model is to include the relative distance between speciation events (i.e.taking the branch lengths into account).

Humans
As the human in this dataset might be a confounding observation due to the fact it is the species all else is compared to and has extended lifespan let us exclude it. The human datapoint has a huge impact on the results. Clearly, the human data point has a huge impact on the observed results.

Body size vs cancer gene counts
To reproduce the plots in fig 4 of  we correlated the body size measures with normalized cancer related gene counts. Figure S38. Linear regression between log maximum body size and copy numbers of different subsets of cancer related genes (TSG, SomaticAndGermline, Oncogene, CareTaker, GateKeeper, Germline). The copy numbers have been obtained using the CAFE approach and only COSMIC Tier 1 genes are included. Each point in the plot represents a species in the dataset. The line and the confidence intervals depicted in the plot come from standard linear regression, the values R 2 , p and N are from phylogenetically adjusted regression.
All of the results are similarly non-significant.

Longevity Quotient (LQ) vs cancer gene copy number counts
Next we looked at the longevity quotient results. The LQ was calculated as in the article of . Figure S39. A plot between log transformed body mass and longevity quotient (LQ). Each point in the plot represents a species. To calculate the LQ we used the lifespan to mass model (i.e. model from figure 1) similarly to .
Clearly human stands out living longer than expected for its body size.
Next we plot the same correlations between numbers of copies of genes and LQ, as we did for maximum body size and lifespan. Figure S40. Linear regression between longevity quotient (LQ) and copy numbers of different subsets of cancer related genes (TSG, SomaticAndGermline, GateKeeper, Germline, CareTaker, Oncogene). The copy numbers have been obtained using the CAFE approach and only COSMIC Tier 1 genes are included. The LQs have been obtained using the same method as in . Each point in the plot represents a species in the dataset. The line and the confidence intervals depicted in the plot come from standard linear regression, the values R 2 , p and N are from phylogenetically adjusted regression.
None of the correlations with copy numbers and LQ are very strong.

Oncogene masked relationship with TSG copy numbers
Finally let's look if the oncogene copy numbers masked relationship between LQ and TSG copy numbers that we detected in the fish dataset is also detectable in the mammal dataset. So again, the results suggest the opposite relationship than observed in fish. Let's exclude the human datapoint as it might be misleading in this context. First the maximum human lifespan is inflated due to much larger sample size and much better hospital care compared to other animals. Secondly, as the dataset is compiled in comparison to human cancer gene count the copy numbers of genes for humans is effectively 1. After removing the human data the result becomes again non-significant.

Analysis on the mammal dataset combined by Tollis et al.(2020)
The section aims to verify if the masked relationsip of TSG and Oncogene link to lifespan holds also in data put together by . To be more conservative phylogenetically informed regression is used similar to .

Read data
There are 63 species present in the data frame by .

Lifespan relation to cancer genes
The first plot plots the phylogenetically informed regression. Figure S41. Linear regression between log transformed maximum body mass and maximum lifespan. Each point in the plot represents a species. The line and the confidence intervals depicted in the plot come from standard linear regression, the values R 2 , p and N are from phylogenetically adjusted regression. This is the same plot as in . However, adding body mass to the model unlinks lifespan from TSG or oncogene copy numbers.

Using Ensembl only genomes
It is interesting to note however, that if to keep only these mammals with a genome assembly available in Ensembl (Having a genome in Ensembl may be considered as having a genome of rather good quality). The phylogenetically informed regression reveals the same masked relationship in the  dataset that we discovered in our fish dataset. See below: So in this subset of data species with longer lifespan have less oncogenes and more TSG's and the relationship is more likely true for TSG's. Notably some of the largest mammals (whales) are missing in this dataset. See the plot below. Figure S42. Linear regression between log transformed maximum body mass and maximum lifespan. Each point in the plot represents a species. The line and the confidence intervals depicted in the plot come from standard linear regression, the values R 2 , p and N are from phylogenetically adjusted regression. This is the same plot as in  except the species that have a genome in Ensembl. The relationship for TSGs holds also if taking body mass into account. In previous sections we checked the existence of the masked relationship in the full Ensembl CAFE dataset of mammals (using CNV calculations from this paper). It's notable that the overlap between the mammal genomes available in both Ensembl and the genomes used in  datasets is relatively small (33 species are in both datasets).

Cancer gene copy numbers and longevity quotient
Finally lets look if the overlapping dataset (33 species) display a masked relationship between TSG's and LQ as we saw in the fish dataset. The overlapping dataset of 33 mammal species, does not indicate that longevity quotient is affected by the TSG and Oncogene counts. So we also failed to demonstrate a clear masked relationship (that we observed in our fish genomes dataset) in the  dataset of mammals.

Conclusions
We were unable to demonstrate a masked relationship with the number of copies of TSGs and oncogenes and maximum lifespan in mammal species from Ensembl (104) using the CAFE gene counts and age and size data from anAge. On the contrary, an inverted relationship can be observed if the human datapoint is included into the model. However by removing the human from the data, because the data collection for human lifespan and cancer gene count is different from other mammals, these results become less reliable. One possible explanation why the masked relationship between the number of copies of TSGs and lifespan does not hold for mammals is the relatively small phylogenetic distance between different mammal species, compared to the distance differences between fish species. It might be that such a relationship emerges only in large scale. Another possible explanation is that the cancer genes that have an ortholog in fish are the most conserved and/or important ones for lifespan.